knitr::opts_chunk$set(message=FALSE, warning=FALSE)
The goal here is going to be trying to figure out if shootings in Baltimore, Maryland have been getting worse over the years. Worse, in this case, means “more severe”. Later in the analysis, we will discuss what we are using as a model of severity.
Let’s pull some data into r. For this tutorial we’re going to use some information about gun violence from the past few years. These data are stored locally as a comma separated value file so we can use the read.csv function to dump it into a data frame object.
library(tidyverse)
library(broom)
library(leaflet)
library(leaflet.extras)
gun_violence <- read.csv("gun-violence-data_01-2013_03-2018.csv")
gun_violence
Now that our data are loaded, we can start playing around with the data set. Before we begin, we should consider our goal for this data anlysis and decide what information from the data set is important for what we’re trying to figure out. The goal here, in broad terms, is to determine of shootings in America have gotten generally “better” or “worse” over the last few years.
So what parts of this data set do we need? A better question might be: what parts of this data set don’t we need? It’s a lot easier to work with the data when you’re not forced to look at lots of irrelevent information. Let’s figure out what we can drop from our data set:
gun_violence <- gun_violence %>%
select(-c(gun_stolen, incident_url_fields_missing, gun_type, source_url, incident_url, incident_characteristics, location_description, notes, sources, participant_type, participant_status, participant_relationship, participant_name, participant_gender, participant_age_group))
gun_violence
The above code removes a bunch of junk that isn’t really useful to us (mostly just urls and data attributes that are missing information for most of the rows of the table). We accomplish this by using the select function. Another interesting thing that we did here is we made use of the pipeline operator (%>%). This operator is basically shorthand for passing a data frame into a function and it is a very useful tool found in the tidyverse library.
Okay, so now that we have only the attributes we care about, let’s get a little more focused. In this case, we only really care about the data for Baltimore, so lets filter our data down to just Baltimore. To reduce this data set down to show only shootings in Baltimore, we can apply the “filter()” function as shown here:
gun_v_baltimore <- gun_violence %>% filter(state == 'Maryland') %>% filter(city_or_county == 'Baltimore County' | city_or_county == 'Baltimore')
gun_v_baltimore
Cool, now we have data from Baltimore only. Let’s see what information we can squeeze out of this data. Now it is time for us to start to define what it means for gun violence to truly get “worse”. We first might want to see if the general amount of shootings per year have gone up or down in the last few years. To figure this out, we obviously need some way of counting up the total number of shootings for a given year. We can do this by using the “group_by()” function. Unfortunately, right now, we don’t have a year attribute, we just have a date. Since we don’t really want to count up the shootings per day, lets make a new attribute and assign the value of the year to it. We do this using the “mutate()” and “format()” functions:
gun_v_baltimore <- gun_v_baltimore %>%
mutate(date = as.Date(as.character(date))) %>%
mutate(year = format(as.Date(date, format="%Y-%m-%d"),"%Y"))
gun_v_baltimore
Now that we have our year column, let’s see if we can see any immediate trends in the amount of shootings over the years:
We’re going to do this by first grouping the data by year, then plotting the number of shootings per year. The “group_by()” function puts our data into groups based on an attribute, then we can use the “summarize()” function to apply an aggregation function to the data. We then use “ggplot()” to plot our information.
for_graph1 <- gun_v_baltimore %>%
group_by(year) %>%
summarize(tot_shootings = n())
for_graph1
ggplot(data=for_graph1, aes(x=factor(year), y=tot_shootings)) + geom_bar(stat="identity")
Hmmm this information seems a little bit odd. How can the number of deaths be so much lower in 2013? What about 2018?
Well, 2018 is fairly easy to explain: we don’t have as much data because it’s still 2018!
But what about 2013? Those numbers CAN’T be right based on the other years, right? Let’s look at the data again and see if we can figure out what’s going on here. We’re going to take a look at 2013 only to see if we can get a better idea about what happened to 2013.
gun_v_baltimore %>% filter(year == "2013")
Going back through this data, it looks like the only data recorded for 2013 has a relatively high injury count. This makes sense because this data was scraped from the internet recently. Because of this, it makes sense that only the high-profile shootings for 2013 would still be easily obtainable from the internet as minor shootings will likely be buried over time.
This creates a bit of a problem. The fact that this data was recently pulled from the internet makes it unreliable (at least for this statistic). We cannot trust the number of shooting deaths per year to be correct because not all the shootings are accounted for for all years. We can choose to leave this data in or we can take it out and use the data from other years. This isn’t ideal, but since the data is so unreliable, let’s focus only on 2014-2017:
gun_v_baltimore <- gun_v_baltimore %>% filter(year >= 2014 & year <= 2017)
gun_v_baltimore
Now that our data set has been organized completely, it is ready to analyze. Since we’re trying to see if shootings have been getting worse or not, let’s try and come up with a definition for the “severity” of a shooting. For simplicity’s sake, let’s say severity = number injured + 2(number killed). Side note: It feels a little weird to quantify human life like that, but I guess I accidentally picked an inherently dark data set to work with here. Anyway, we’ll add this as a column in our table, then we’ll plot the severity over time.
gun_v_baltimore <- gun_v_baltimore %>% mutate(severity = 2 * n_killed + n_injured)
gun_v_baltimore %>% ggplot(aes(x=date, y=severity)) + geom_point()
Well, that was… underwhelming. It seems that the majority of shootings in this data set are (thankfully) not extremely severe. Most shootings appear to involve very few victims. It isn’t very easy to see a trend here, so let’s try to fit this graph with a linear regression line:
gun_v_baltimore <- gun_v_baltimore %>% mutate(severity = 2 * n_killed + n_injured)
gun_v_baltimore %>% ggplot(aes(x=date, y=severity)) + geom_point() + geom_smooth(method="lm")
Even after fitting it with a line, we can see very little slope in the line itself, which suggests that there is an incredibly small uptick in severity of shootings over time.
Let’s try this again, but using the average severity for each year:
avgs <- gun_v_baltimore %>% group_by(year) %>% summarize(avg_severity=mean(severity))
gun_v_baltimore <- gun_v_baltimore %>% group_by(year) %>% mutate(avg_severity=mean(severity))
avgs %>% ggplot(aes(x=as.Date(year, format="%Y"), y=avg_severity)) + geom_point() + geom_smooth(method="lm")
Using this information, we can start to make some interesting predictions about the future. It definitely seems like the average severity of shootings is staying about the same, but what if we want to know specifics about 2018 or some other year in the future? Let’s try building a linear model from this information:
linearfit <- lm(formula=avg_severity~date, data=gun_v_baltimore)
linearfit %>% tidy()
So, from our linear model, it looks like the severity of shootings in baltimore is increasing at the miniscule rate of about .0000452 per year. Obviously, this is an absolutley miniscule rate of change. If this trend is to continue in the future, we can expect to see roughly the same amount of gun violence in baltimore over the coming years.
So to answer our initial question about whether or not gun violence is getting worse, the answer is: yes, but very, very minimally. So minimally, in fact that assuming any increase at all over the next year would be foolish.
This conclusion is likely to be somewhat faulty because the data used was only from the last 3 years and the data set it was pulled from certainly wasn’t curated very well. Unfortunately, I didn’t start to realize this until I was way too far into the project to turn back, so this didn’t quite come out the way it really should have. There is likely a lot more to say about this data and gun violence in America, but it certainly shouldn’t be concluded using this data set alone. What this data does suggest, however, is that although we have seen a lot of media coverage about mass shootings over the last few years and it seems like the average shooting is getting worse over time, these mass shootings appear to be outliers because the overall severity trend (at least for the small subset we looked at here) has not seen much of an increase.